Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce native Readers read flatValues directly from faiss file. #2267

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

luyuncheng
Copy link
Collaborator

ISSUE: #2266

Faiss code: faiss/impl/index_write.cpp


|-typeIDMap   -|-id_header-|
|-typeHnsw    -|-hnsw_header-|-hnswGraph-|
|-typeStorage -|-storage_Header-|-storageVector-|
|-idmap_vector-|-FOOTER_MAGIC+CHECKSUM-|

AND IndexHeader like:

|dim|ntotal|dummy|dummy|is_trained|metric_type|metric_arg|

Example for HNSW32,Flat, the whole file like:

|idMapType|idMapHeader|hnswType|hnswHeader|hnswGraph|flatType|flatHeader|Vectors|IdVector|FOOTER_MAGIC+CHECKSUM|

so we can read the file from directly like this PR

Signed-off-by: luyuncheng <[email protected]>

@Override
public void checkIntegrity() throws IOException {

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this intentionally kept no-op?

return fieldFileMap.containsKey(field) && fieldMetaMap.containsKey(field);
}

private MetaInfo read_index_header(IndexInput in) throws IOException {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: camelCase?

continue;
}
// TODO if not exist file, change to lucene flatVector
IndexInput in = state.directory.openInput(vectorIndexFileName, state.context.withRandomAccess());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if we want to give access to entire file through flat vectors reader. Is it possible to limit the access to using IndexInput#slice and then storing it in the map?

this.metaInfo = metaInfo;
this.slice = input.clone();
this.similarityFunction = getVectorSimilarityFunction(metaInfo.metricType).getVectorSimilarityFunction();
this.flatVectorsScorer = FlatVectorScorerUtil.getLucene99FlatVectorsScorer();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed?

@Override
public float[] vectorValue() throws IOException {
if (ord % BUCKET_VECTORS == 0) {
readBucketVectors();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Want to understand the thought process here, why do we need to read 64 vectors at once? we can seek to the position based on ordinal and read the required vector right?

import java.io.IOException;

/** Reads vectors from faiss index. like Lucene KnnVectorsReader without search */
public abstract class FaissEngineKnnVectorsReader implements Closeable {
Copy link
Collaborator

@shatejas shatejas Dec 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering why a new interface was created, why not use existing FlatVectorReader? This way we can create a KNNFlatVectorFormat and encapsulate when to use FAISS vs Lucene flat readers logic at one place. That removes the need for methods like isNativeVectors(String field) to be exposed and called by clients

Same goes for writer, instead of having if else condition in NativeEngines990KnnVectorsWriter#flush we can abstract the if else in the codec writer

@shatejas
Copy link
Collaborator

Please fix the CI

  • Any BWC issues that we need to look out for?
  • Changes on the indexing side seemed to be pending, are you planning to have a separate PR? ideally we should do it in this PR
  • Unit tests are missing, there are significant number of if - continue conditions, please add them
  • Please add benchmarking results

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants